[main][test] Refactor the mtp and eagle test case by lilinsiman · Pull Request #5326 · vllm-project/vllm-ascend

lilinsiman · 2025-12-24T08:31:28Z

What this PR does / why we need it?

Refactor the current test with mtp and eagle cases
Add new necessary cases with mtp and eagle

Does this PR introduce any user-facing change?

no

How was this patch tested?

ut

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@5fbfa8d

gemini-code-assist

Code Review

This pull request refactors the MTP and Eagle speculative decoding test cases, which is a good improvement for maintainability. However, I've identified several areas with code duplication and misuse of context managers. Specifically, there are redundant calls to cleanup functions and del statements for objects managed by with statements. These issues affect code clarity and could lead to subtle bugs. I've provided suggestions to refactor the duplicated code and remove the unnecessary calls.

gemini-code-assist · 2025-12-24T08:34:01Z

+        print(f"golden: {golden}")
+
+    assert match
+    cleanup_dist_env_and_memory()


The VllmRunner context manager automatically handles resource cleanup, including calling cleanup_dist_env_and_memory(), in its __exit__ method. This explicit call is redundant and should be removed to improve code clarity.

gemini-code-assist · 2025-12-24T08:34:01Z

+    if not enforce_eager:
+        if cudagraph_mode == "FULL_DECODE_ONLY":
+            pytest.skip("This case will be supported in future")
+        with VllmRunner(model_name,
+                        tensor_parallel_size=4,
+                        max_model_len=4096,
+                        gpu_memory_utilization=0.8,
+                        distributed_executor_backend="mp",
+                        speculative_config={
+                            "method":
+                            "mtp",
+                            "num_speculative_tokens":
+                            num_speculative_tokens,
+                            "disable_padded_drafter_batch":
+                            disable_padded_drafter_batch,
+                        },
+                        enforce_eager=enforce_eager,
+                        compilation_config=CompilationConfig(
+                            cudagraph_mode=cudagraph_mode,
+                            cudagraph_capture_sizes=[12],
+                        )) as spec_llm:
+            spec_outputs = spec_llm.generate_greedy(example_prompts, max_tokens)
+        del spec_llm
+
+    else:
+        if cudagraph_mode == "PIECEWISE":
+            pytest.skip("skipping the repeating case")
+        with VllmRunner(model_name,
+                        tensor_parallel_size=4,
+                        max_model_len=4096,
+                        gpu_memory_utilization=0.8,
+                        distributed_executor_backend="mp",
+                        speculative_config={
+                            "method":
+                            "mtp",
+                            "num_speculative_tokens":
+                            num_speculative_tokens,
+                            "disable_padded_drafter_batch":
+                            disable_padded_drafter_batch,
+                        },
+                        enforce_eager=enforce_eager,
+                        ) as spec_llm:
+            spec_outputs = spec_llm.generate_greedy(example_prompts, max_tokens)
+        del spec_llm


The if/else block for enforce_eager contains a lot of duplicated code for initializing VllmRunner. This can be refactored to reduce redundancy and improve readability. Additionally, the del spec_llm calls are unnecessary when using a with statement, as the context manager handles cleanup.

compilation_config = None if not enforce_eager: if cudagraph_mode == "FULL_DECODE_ONLY": pytest.skip("This case will be supported in future") compilation_config = CompilationConfig( cudagraph_mode=cudagraph_mode, cudagraph_capture_sizes=[12], ) else: if cudagraph_mode == "PIECEWISE": pytest.skip("skipping the repeating case") with VllmRunner(model_name, tensor_parallel_size=4, max_model_len=4096, gpu_memory_utilization=0.8, distributed_executor_backend="mp", speculative_config={ "method": "mtp", "num_speculative_tokens": num_speculative_tokens, "disable_padded_drafter_batch": disable_padded_drafter_batch, }, enforce_eager=enforce_eager, compilation_config=compilation_config) as spec_llm: spec_outputs = spec_llm.generate_greedy(example_prompts, max_tokens)

gemini-code-assist · 2025-12-24T08:34:01Z

+    # Heuristic: expect at least 66% of the prompts to match exactly
+    # Upon failure, inspect the outputs to check for inaccuracy.
+    assert matches > int(0.66 * len(ref_outputs))
+    cleanup_dist_env_and_memory()


The VllmRunner context manager automatically handles resource cleanup, including calling cleanup_dist_env_and_memory(), in its __exit__ method. This explicit call is redundant and should be removed.

gemini-code-assist · 2025-12-24T08:34:01Z

+    if not enforce_eager:
+        with VllmRunner(model_name,
+                        tensor_parallel_size=1,
+                        max_num_seqs=256,
+                        gpu_memory_utilization=0.7,
+                        distributed_executor_backend="mp",
+                        enable_expert_parallel=True,
+                        speculative_config={
+                            "method":
+                            "mtp",
+                            "num_speculative_tokens":
+                            num_speculative_tokens,
+                            "disable_padded_drafter_batch":
+                            disable_padded_drafter_batch,
+                        },
+                        enforce_eager=enforce_eager,
+                        max_model_len=2000,
+                        compilation_config=CompilationConfig(
+                            cudagraph_mode=cudagraph_mode,
+                            cudagraph_capture_sizes=[12],
+                        )) as spec_llm:
+            sampling_config = SamplingParams(temperature=0, max_tokens=256, ignore_eos=False)
+            spec_outputs = spec_llm.generate(example_prompts, sampling_config)
+
+    else:
+        if cudagraph_mode == "PIECEWISE":
+            pytest.skip("skipping the repeating case")
+        with VllmRunner(model_name,
+                        tensor_parallel_size=1,
+                        max_num_seqs=256,
+                        gpu_memory_utilization=0.7,
+                        distributed_executor_backend="mp",
+                        enable_expert_parallel=True,
+                        speculative_config={
+                            "method":
+                            "mtp",
+                            "num_speculative_tokens":
+                            num_speculative_tokens,
+                            "disable_padded_drafter_batch":
+                            disable_padded_drafter_batch,
+                        },
+                        enforce_eager=enforce_eager,
+                        max_model_len=2000
+                        ) as spec_llm:
+            sampling_config = SamplingParams(temperature=0, max_tokens=256, ignore_eos=False)
+            spec_outputs = spec_llm.generate(example_prompts, sampling_config)


This if/else block for enforce_eager has a lot of duplicated code for VllmRunner initialization. It can be refactored to be more concise and maintainable.

compilation_config = None if not enforce_eager: compilation_config = CompilationConfig( cudagraph_mode=cudagraph_mode, cudagraph_capture_sizes=[12], ) else: if cudagraph_mode == "PIECEWISE": pytest.skip("skipping the repeating case") with VllmRunner(model_name, tensor_parallel_size=1, max_num_seqs=256, gpu_memory_utilization=0.7, distributed_executor_backend="mp", enable_expert_parallel=True, speculative_config={ "method": "mtp", "num_speculative_tokens": num_speculative_tokens, "disable_padded_drafter_batch": disable_padded_drafter_batch, }, enforce_eager=enforce_eager, max_model_len=2000, compilation_config=compilation_config) as spec_llm: sampling_config = SamplingParams(temperature=0, max_tokens=256, ignore_eos=False) spec_outputs = spec_llm.generate(example_prompts, sampling_config)

gemini-code-assist · 2025-12-24T08:34:02Z

+    cleanup_dist_env_and_memory()
+    del spec_llm


The VllmRunner context manager handles resource cleanup. The explicit calls to cleanup_dist_env_and_memory() and del spec_llm are redundant and should be removed.

gemini-code-assist · 2025-12-24T08:34:02Z

+        cleanup_dist_env_and_memory()
+        del llm


The VllmRunner context manager (with statement) automatically handles cleanup. The explicit calls to cleanup_dist_env_and_memory() and del llm are redundant. Placing them inside the with block is particularly confusing and they should be removed.

gemini-code-assist · 2025-12-24T08:34:02Z

+        cleanup_dist_env_and_memory()
+        del llm


The VllmRunner context manager (with statement) automatically handles cleanup. The explicit calls to cleanup_dist_env_and_memory() and del llm are redundant and should be removed.

wangxiyuan · 2025-12-24T10:37:43Z

          VLLM_WORKER_MULTIPROC_METHOD: spawn
        if: ${{ inputs.type == 'full' }}
        run: |
+          pytest -sv --durations=0 tests/e2e/multicard/spec_decode_v1/test_mtp_qwen3_next.py


move to 4-card test part

github-actions · 2025-12-24T11:25:59Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan · 2025-12-25T02:31:48Z

+
+@pytest.mark.parametrize("model_name", MODELS)
+@pytest.mark.parametrize("num_speculative_tokens", [1,2,3])
+@pytest.mark.parametrize("enforce_eager", [True, False])


remove enforce_eager=True

wangxiyuan · 2025-12-25T02:34:46Z

+@pytest.mark.parametrize(
+    "cudagraph_mode",
+    [
+        CUDAGraphMode.NONE,


wangxiyuan · 2025-12-25T02:35:48Z

+    [
+        CUDAGraphMode.NONE,
+        CUDAGraphMode.PIECEWISE,
+        CUDAGraphMode.FULL_DECODE_ONLY,


add it back once FULL_DECODE_ONLY works

wangxiyuan · 2025-12-25T02:42:40Z

+@pytest.mark.parametrize("method", ["eagle", "eagle3"])
+@pytest.mark.parametrize("disable_padded_drafter_batch", [True, False])
+@pytest.mark.parametrize("async_scheduling", [True, False])
+def test_offline_eagle_correctness(model_name: str,


test_<model_name>

test_deepseek_mtp_accuracy

github-actions · 2025-12-26T06:33:30Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wjunLu · 2025-12-26T09:34:17Z

There are some conflicts, please rebase

github-actions · 2025-12-27T16:25:53Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-29T02:05:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: lilinsiman <lilinsiman@gmail.com>

github-actions · 2025-12-30T12:58:37Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: lilinsiman <lilinsiman@gmail.com>

lilinsiman · 2025-12-31T00:33:09Z

CI has already passed.
e2e-Full:
multicard-2：https://github.com/vllm-project/vllm-ascend/actions/runs/20591401008/job/59137103095?pr=5326
singlecard：https://github.com/vllm-project/vllm-ascend/actions/runs/20591401008/job/59137103089?pr=5326
multicard-4：https://github.com/vllm-project/vllm-ascend/actions/runs/20596450900/job/59152056543?pr=5326

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [feature] mooncake support pcp/dcp in common conditions (vllm-project#5224) [Bugfix] Fix mm_merge (vllm-project#5249) [Main2Main] Upgrade vllm commit to 1230 (vllm-project#5495) [Feature] Refactor PCP &DCP related code (vllm-project#5214) [main][test] Refactor the mtp and eagle test case (vllm-project#5326) [smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (vllm-project#5521) [2/N] Upgrade nightly doc (vllm-project#5534) [Doc] Add new contributors. (vllm-project#5537) [3/N][Nightly] Move ops tests to nightly (vllm-project#5538)

### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>

### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>

### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Refactor the current test with mtp and eagle cases 2. Add new necessary cases with mtp and eagle ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@5fbfa8d --------- Signed-off-by: lilinsiman <lilinsiman@gmail.com>

gemini-code-assist Bot reviewed Dec 24, 2025

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Dec 24, 2025

wangxiyuan reviewed Dec 24, 2025

View reviewed changes

wangxiyuan removed the ready-for-test start test by label for PR label Dec 24, 2025

github-actions Bot added ci/build module:tests labels Dec 24, 2025

wangxiyuan reviewed Dec 25, 2025

View reviewed changes

MengqingCao added the ready-for-test start test by label for PR label Dec 25, 2025

github-actions Bot added the merge-conflicts label Dec 26, 2025

github-actions Bot added merge-conflicts and removed merge-conflicts labels Dec 27, 2025

github-actions Bot added merge-conflicts and removed merge-conflicts labels Dec 29, 2025

lilinsiman added 2 commits December 29, 2025 11:13

[main][test] Refactor the mtp and eagle test case

ebff39f

Signed-off-by: lilinsiman <lilinsiman@gmail.com>

Optimized some cases and addressed formatting issues

13318fa

Signed-off-by: lilinsiman <lilinsiman@gmail.com>

github-actions Bot removed the merge-conflicts label Dec 29, 2025

slippersss mentioned this pull request Dec 29, 2025

[RFC]: Refactor and unify eagle_proposer and mtp_proposer #5467

Closed

github-actions Bot added the merge-conflicts label Dec 30, 2025

resolve the conflicts for fullgraph

cc7e504

Signed-off-by: lilinsiman <lilinsiman@gmail.com>

github-actions Bot removed the merge-conflicts label Dec 31, 2025

wangxiyuan merged commit 46862ce into vllm-project:main Dec 31, 2025
14 checks passed

Conversation

lilinsiman commented Dec 24, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Dec 24, 2025

Uh oh!

wangxiyuan Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Dec 26, 2025

Uh oh!

wjunLu commented Dec 26, 2025

Uh oh!

github-actions Bot commented Dec 27, 2025

Uh oh!

github-actions Bot commented Dec 29, 2025

Uh oh!

github-actions Bot commented Dec 30, 2025

Uh oh!

lilinsiman commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lilinsiman commented Dec 24, 2025 •

edited by github-actions Bot

Loading